Employee Job Attrition

AE Rodriguez & Mo Cayer
March 1, 2019

Watson Analytics Sample Data: HR Employee Data Attrition & Performance

OBJECTIVES

ACME Inc. has an attrition problem. You have been asked to find out why, how to stem it, and to identify those employees inclined to leave.

** Objective 1: Identify the “reasons” underscoring the employee exodus by developing a machine learning algorithm **

  + We will be able to explore important questions such as: 

  + "show me a breakdown of distance from home by job role and attrition." 

  + "Compare average monthly income by education and attrition'.

  + We will be able to identify the most imporant reasons behind
  and employees daparture.

** Objective 2: Use the model to “stem the tide” of employees heading for the exits **

This is a fictional data set created by IBM data scientists.
The IBM Watson Analytics Lab website

Data Structure: All Variables

 [1] "Age"                      "Attrition"               
 [3] "BusinessTravel"           "DailyRate"               
 [5] "Department"               "DistanceFromHome"        
 [7] "Education"                "EducationField"          
 [9] "EnvironmentSatisfaction"  "Gender"                  
[11] "HourlyRate"               "JobInvolvement"          
[13] "JobLevel"                 "JobRole"                 
[15] "JobSatisfaction"          "MaritalStatus"           
[17] "MonthlyIncome"            "MonthlyRate"             
[19] "NumCompaniesWorked"       "OverTime"                
[21] "PercentSalaryHike"        "PerformanceRating"       
[23] "RelationshipSatisfaction" "StockOptionLevel"        
[25] "TotalWorkingYears"        "TrainingTimesLastYear"   
[27] "WorkLifeBalance"          "YearsAtCompany"          
[29] "YearsInCurrentRole"       "YearsSinceLastPromotion" 
[31] "YearsWithCurrManager"

Structure

Variables in DataSet, Examples

  Attrition     Education EducationField EnvironmentSatisfaction
1       Yes       College  Life_Sciences                  Medium
2        No Below_College  Life_Sciences                    High
4       Yes       College          Other               Very_High
5        No        Master  Life_Sciences               Very_High
7        No Below_College        Medical                     Low
8        No       College  Life_Sciences               Very_High

  Gender JobLevel JobSatisfaction MaritalStatus MonthlyIncome
1 Female        2       Very_High        Single          5993
2   Male        2          Medium       Married          5130
4   Male        1            High        Single          2090
5 Female        1            High       Married          2909
7   Male        1          Medium       Married          3468
8   Male        1       Very_High        Single          3068

Structure

Variables in DataSet, Examples continued

  OverTime PercentSalaryHike RelationshipSatisfaction
1      Yes                11                      Low
2       No                23                Very_High
4      Yes                15                   Medium

  TotalWorkingYears TrainingTimesLastYear WorkLifeBalance
1                 8                     0             Bad
2                10                     3          Better
4                 7                     3          Better

  YearsAtCompany YearsSinceLastPromotion YearsWithCurrManager
1              6                       0                    5
2             10                       1                    7
4              0                       0                    0

Exploratory Data Analysis

Correlation Plot 1: All Variables

plot of chunk unnamed-chunk-4

Correlation Plot 2: All Variables

plot of chunk unnamed-chunk-5

Overtime, Age, & Marital Status

plot of chunk unnamed-chunk-6

Monthly Income, Age, & Marital Status

plot of chunk unnamed-chunk-7

Hourly Rate v. Age & Gender

plot of chunk unnamed-chunk-8

Distance from Home v. Job Level

plot of chunk unnamed-chunk-9

Overtime, Age & Marital Status

plot of chunk unnamed-chunk-10

Exploratory Data Analysis

Identify Significant Predictors Using Random Forests

plot of chunk unnamed-chunk-11

Training a Model

Random Forests

Confusion Matrix and Statistics

          Reference
Prediction   No  Yes
       No  1233    0
       Yes    0  237

               Accuracy : 1          
                 95% CI : (0.9975, 1)
    No Information Rate : 0.8388     
    P-Value [Acc > NIR] : < 2.2e-16  

                  Kappa : 1          
 Mcnemar's Test P-Value : NA         

            Sensitivity : 1.0000     
            Specificity : 1.0000     
         Pos Pred Value : 1.0000     
         Neg Pred Value : 1.0000     
             Prevalence : 0.8388     
         Detection Rate : 0.8388     
   Detection Prevalence : 0.8388     
      Balanced Accuracy : 1.0000     

       'Positive' Class : No

Random Forest Model Sensitivity (Partial Dependence)

Monthly Income

plot of chunk unnamed-chunk-13

Age

plot of chunk unnamed-chunk-14

Training a Model

Logistic Regression

Confusion Matrix and Statistics

          Reference
Prediction   No  Yes
       No  1215  202
       Yes   18   35

               Accuracy : 0.8503          
                 95% CI : (0.8311, 0.8682)
    No Information Rate : 0.8388          
    P-Value [Acc > NIR] : 0.1203          

                  Kappa : 0.1939          
 Mcnemar's Test P-Value : <2e-16          

            Sensitivity : 0.9854          
            Specificity : 0.1477          
         Pos Pred Value : 0.8574          
         Neg Pred Value : 0.6604          
             Prevalence : 0.8388          
         Detection Rate : 0.8265          
   Detection Prevalence : 0.9639          
      Balanced Accuracy : 0.5665          

       'Positive' Class : No

Logistic Regression Model Sensitivity (Partial Dependence)

Age & Year at the Company

plot of chunk unnamed-chunk-16

Monthly Income & Distance From Home

plot of chunk unnamed-chunk-17

Training a Model

Naive Bayes

Confusion Matrix and Statistics

          Reference
Prediction   No  Yes
       No  1122  133
       Yes  111  104

               Accuracy : 0.834          
                 95% CI : (0.814, 0.8527)
    No Information Rate : 0.8388         
    P-Value [Acc > NIR] : 0.7046         

                  Kappa : 0.3624         
 Mcnemar's Test P-Value : 0.1788         

            Sensitivity : 0.9100         
            Specificity : 0.4388         
         Pos Pred Value : 0.8940         
         Neg Pred Value : 0.4837         
             Prevalence : 0.8388         
         Detection Rate : 0.7633         
   Detection Prevalence : 0.8537         
      Balanced Accuracy : 0.6744         

       'Positive' Class : No

Model Comparison

Comparing MOdels

plot of chunk unnamed-chunk-19

Prediction

Who is likely to leave?

     Attrition MonthlyIncome Age OverTime DailyRate TotalWorkingYears
1933       Yes          9854  28      Yes      1475                 6
410         No         16015  41       No       334                22
198         No          2720  30       No      1427                 6
                   JobRole MonthlyRate HourlyRate DistanceFromHome
1933       Sales_Executive       23352         84               13
410                Manager       15896         88                2
198  Laboratory_Technician       11162         35                2
     YearsAtCompany
1933              2
410              22
198               5

1933  410  198 
 Yes   No   No 
Levels: No Yes